This week we are exploring dialogue in Shakespeare play!
The dataset this week comes from shakespeare.mit.edu (via github.com/nrennie/shakespeare) which is the Web’s first edition of the Complete Works of William Shakespeare. The site has offered Shakespeare’s plays and poetry to the internet community since 1993.
Dialogue from Hamlet, Macbeth, and Romeo and Juliet are provided for this week. Which play has the most stage directions compared to dialogue? Which play has the longest lines of dialogue? Which character speaks the most?
hamlet.csv
variable
class
description
act
character
Act number.
scene
character
Scene number.
character
character
Name of character speaking or whether it’s a stage direction.
dialogue
character
Text of dialogue or stage direction.
line_number
double
Dialogue line number.
macbeth.csv
variable
class
description
act
character
Act number.
scene
character
Scene number.
character
character
Name of character speaking or whether it’s a stage direction.
dialogue
character
Text of dialogue or stage direction.
line_number
double
Dialogue line number.
romeo_juliet.csv
variable
class
description
act
character
Act number.
scene
character
Scene number.
character
character
Name of character speaking or whether it’s a stage direction.
dialogue
character
Text of dialogue or stage direction.
line_number
double
Dialogue line number.
Load the data
# Load the tidytuesday packagesuppressMessages(library(tidytuesdayR)) # For accessing TidyTuesday datasetssuppressMessages(library(skimr)) # For summary and descriptive statisticssuppressMessages(library(tidyverse)) # For data manipulation and visualizationsuppressMessages(library(dplyr)) # For data manipulation and transformationsuppressMessages(library(ggplot2)) # For data visualizationsuppressMessages(library(RColorBrewer)) # For color palettes in visualizationssuppressMessages(library(ggimage)) # For adding images to plotssuppressMessages(library(tidytext)) suppressMessages(library(sentimentr))suppressMessages(library(ggpubr))# Load the current week's datasettuesdata <- tidytuesdayR::tt_load('2024-09-17')
Downloading file 1 of 3: `hamlet.csv`
Downloading file 2 of 3: `macbeth.csv`
Downloading file 3 of 3: `romeo_juliet.csv`
# Extract datasets from the TidyTuesday datasethamlet <- tuesdata$hamletmacbeth <- tuesdata$macbethromeo_juliet <- tuesdata$romeo_juliet# Rename datasets #ca <- college_admissions# Explore the structure of the datasetstr(hamlet) # Display the structure of 'hamlet'
spc_tbl_ [4,217 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ act : chr [1:4217] "Act I" "Act I" "Act I" "Act I" ...
$ scene : chr [1:4217] "Scene I" "Scene I" "Scene I" "Scene I" ...
$ character : chr [1:4217] "[stage direction]" "Bernardo" "Francisco" "Bernardo" ...
$ dialogue : chr [1:4217] "FRANCISCO at his post. Enter to him BERNARDO" "Who's there?" "Nay, answer me: stand, and unfold yourself." "Long live the king!" ...
$ line_number: num [1:4217] NA 1 2 3 4 5 6 7 8 9 ...
- attr(*, "spec")=
.. cols(
.. act = col_character(),
.. scene = col_character(),
.. character = col_character(),
.. dialogue = col_character(),
.. line_number = col_double()
.. )
- attr(*, "problems")=<externalptr>
str(macbeth) # Display the structure of 'macbeth'
spc_tbl_ [2,553 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ act : chr [1:2553] "Act I" "Act I" "Act I" "Act I" ...
$ scene : chr [1:2553] "Scene I" "Scene I" "Scene I" "Scene I" ...
$ character : chr [1:2553] "[stage direction]" "First Witch" "First Witch" "Second Witch" ...
$ dialogue : chr [1:2553] "Thunder and lightning. Enter three Witches" "When shall we three meet again" "In thunder, lightning, or in rain?" "When the hurlyburly's done," ...
$ line_number: num [1:2553] NA 1 2 3 4 5 6 7 8 9 ...
- attr(*, "spec")=
.. cols(
.. act = col_character(),
.. scene = col_character(),
.. character = col_character(),
.. dialogue = col_character(),
.. line_number = col_double()
.. )
- attr(*, "problems")=<externalptr>
str(romeo_juliet) # Display the structure of 'romeo_juliet'
spc_tbl_ [3,282 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ act : chr [1:3282] "Act I" "Act I" "Act I" "Act I" ...
$ scene : chr [1:3282] "Prologue" "Prologue" "Prologue" "Prologue" ...
$ character : chr [1:3282] "Chorus" "Chorus" "Chorus" "Chorus" ...
$ dialogue : chr [1:3282] "Two households, both alike in dignity," "In fair Verona, where we lay our scene," "From ancient grudge break to new mutiny," "Where civil blood makes civil hands unclean." ...
$ line_number: num [1:3282] 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "spec")=
.. cols(
.. act = col_character(),
.. scene = col_character(),
.. character = col_character(),
.. dialogue = col_character(),
.. line_number = col_double()
.. )
- attr(*, "problems")=<externalptr>
skim(hamlet) # Provide detailed summary statistics for 'hamlet' (missing values, summary stats)
Data summary
Name
hamlet
Number of rows
4217
Number of columns
5
_______________________
Column type frequency:
character
4
numeric
1
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
act
0
1
5
7
0
5
0
scene
0
1
7
9
0
7
0
character
0
1
3
17
0
36
0
dialogue
0
1
3
671
0
4118
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
line_number
206
0.95
2006
1158.02
1
1003.5
2006
3008.5
4011
▇▇▇▇▇
skim(macbeth) # Provide detailed summary statistics for 'hamlet' (missing values, summary stats)
Data summary
Name
macbeth
Number of rows
2553
Number of columns
5
_______________________
Column type frequency:
character
4
numeric
1
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
act
0
1
5
7
0
5
0
scene
0
1
7
10
0
8
0
character
0
1
3
17
0
42
0
dialogue
0
1
3
132
0
2484
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
line_number
169
0.93
1192.5
688.35
1
596.75
1192.5
1788.25
2384
▇▇▇▇▇
skim(romeo_juliet) # Provide detailed summary statistics for 'hamlet' (missing values, summary stats)
Data summary
Name
romeo_juliet
Number of rows
3282
Number of columns
5
_______________________
Column type frequency:
character
4
numeric
1
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
act
0
1
5
7
0
5
0
scene
0
1
7
9
0
7
0
character
0
1
4
17
0
35
0
dialogue
0
1
3
90
0
3205
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
line_number
189
0.94
1547
893.02
1
774
1547
2320
3093
▇▇▇▇▇
# Export data# write.csv(hamlet, "hamlet.csv", row.names = FALSE)# write.csv(macbeth, "macbeth.csv", row.names = FALSE)# write.csv(romeo_juliet, "romeo_juliet.csv", row.names = FALSE)# Combine datasetscombined_plays<-bind_rows(mutate(hamlet, play ="Hamlet"),mutate(macbeth, play ="Macbeth"),mutate(romeo_juliet, play ="Romeo and Juliet"))#write.csv(combined_plays, "combined_plays.csv", row.names = FALSE)#tidytuesdayR::use_tidytemplate()
#### Clean the data# Missing values are associated with line numbers and stage direction
EDA
############ Wordcount by Character ############hamlet_word_count <- hamlet %>%filter(!str_detect(character, "\\[stage direction\\]")) %>%unnest_tokens(word, dialogue) %>%count(character, sort =TRUE)ggplot(hamlet_word_count, aes(x =reorder(character, n), y = n)) +geom_col() +coord_flip() +labs(title ="Word Count by Character in Hamlet", x ="Character", y ="Word Count")
macbeth_word_count <- macbeth %>%filter(!str_detect(character, "\\[stage direction\\]")) %>%unnest_tokens(word, dialogue) %>%count(character, sort =TRUE)ggplot(macbeth_word_count, aes(x =reorder(character, n), y = n)) +geom_col() +coord_flip() +labs(title ="Word Count by Character in Macbeth", x ="Character", y ="Word Count")
romeo_juliet_word_count <- romeo_juliet %>%filter(!str_detect(character, "\\[stage direction\\]")) %>%unnest_tokens(word, dialogue) %>%count(character, sort =TRUE)ggplot(romeo_juliet_word_count, aes(x =reorder(character, n), y = n)) +geom_col() +coord_flip() +labs(title ="Word Count by Character in Romeo & Juliet", x ="Character", y ="Word Count")
############ Dialogue by Scene ############ hamlet_lines_per_scene <- hamlet %>%group_by(act, scene) %>%summarise(total_lines =n())ggplot(hamlet_lines_per_scene, aes(x = scene, y = total_lines, fill = act)) +geom_col() +facet_wrap(~act, scales ="free_x") +theme(axis.text.x =element_text(angle =90, hjust =1)) +labs(title ="Number of Lines per Scene in Hamlet", x ="Scene", y ="Total Lines")
macbeth_lines_per_scene <- macbeth %>%group_by(act, scene) %>%summarise(total_lines =n())ggplot(macbeth_lines_per_scene, aes(x = scene, y = total_lines, fill = act)) +geom_col() +facet_wrap(~act, scales ="free_x") +theme(axis.text.x =element_text(angle =90, hjust =1)) +labs(title ="Number of Lines per Scene in Macbeth", x ="Scene", y ="Total Lines")
romeo_juliet_lines_per_scene <- romeo_juliet %>%group_by(act, scene) %>%summarise(total_lines =n())ggplot(romeo_juliet_lines_per_scene, aes(x = scene, y = total_lines, fill = act)) +geom_col() +facet_wrap(~act, scales ="free_x") +theme(axis.text.x =element_text(angle =90, hjust =1)) +labs(title ="Number of Lines per Scene in Romeo Juliet", x ="Scene", y ="Total Lines")
############# Stage Direction vs Dialogue ############# hamlet_stage_vs_dialogue <- hamlet %>%mutate(type =ifelse(str_detect(character, "\\[stage direction\\]"), "Stage Direction", "Dialogue")) %>%group_by(type) %>%summarise(total_lines =n())ggplot(hamlet_stage_vs_dialogue, aes(x = type, y = total_lines, fill = type)) +geom_col() +labs(title ="Stage Directions vs Dialogue in Hamlet", x ="Type", y ="Number of Lines")
############# Combined counts ############# play_word_count <- combined_plays %>%filter(!str_detect(character, "\\[stage direction\\]")) %>%unnest_tokens(word, dialogue) %>%count(play, character, sort =TRUE)play_word_count_filter <- play_word_count %>%filter(n >200)ggplot(play_word_count_filter, aes(x =reorder(character, n), y = n)) +geom_col() +coord_flip() +labs(title ="Word Count by Character in Hamlet", x ="Character", y ="Word Count")
play_line_count <- combined_plays %>%filter(!str_detect(character, "\\[stage direction\\]")) %>%group_by(play, character, act) %>%summarise(line_count =n()) %>%filter(line_count >100)ggplot(play_line_count, aes(x =reorder(character, line_count), y = line_count, fill = play)) +geom_col() +coord_flip() +# Flips the axes for better readabilitylabs(title ="Line Count per Character in Plays",x ="Character",y ="Line Count") +theme_minimal() +theme(legend.position ="bottom") # Adjust legend position if needed
###play_line_count2 <- combined_plays %>%filter(!str_detect(character, "\\[stage direction\\]")) %>%group_by(play, character, act) %>%summarise(line_count =n()) %>%filter(line_count >1)# Define the number of top characters you want to displaytop_n_characters <-10# Create a summary of line counts by character across all actstop_characters <- play_line_count2 %>%group_by(character) %>%summarise(total_lines =sum(line_count)) %>%arrange(desc(total_lines)) %>%slice_head(n = top_n_characters) %>%pull(character)# Filter the original data for these top charactersfiltered_play_line_count <- play_line_count2 %>%filter(character %in% top_characters)# Create line plots for each play with the filtered charactersggplot(filtered_play_line_count, aes(x = act, y = line_count, color = character, group = character)) +geom_line(size =1) +geom_point(size =3) +labs(title ='Line Count by Top Characters Across Acts',x ='Act',y ='Line Count') +facet_wrap(~ play) +scale_color_brewer(palette ="Set1") +# Choose a color palettetheme_minimal() +theme(legend.position ="bottom")
Plot 1 Number of Lines per Character by Play
play_line_count_char <- combined_plays %>%filter(!str_detect(character, "\\[stage direction\\]")) %>%group_by(play, character) %>%summarise(line_count =n(), .groups ="drop") %>%filter(line_count >100)%>%arrange(desc(line_count))# Random colors function# Define a function to generate colors based on a data frame and columngenerate_colors <-function(data, column) { num_items <-length(unique(data[[column]]))if (num_items <=8) {return(brewer.pal(num_items, "Set3")) } else {return(colorRampPalette(brewer.pal(12, "Set3"))(num_items)) }}random_colors <-generate_colors(play_line_count_char, "character")ggplot(play_line_count_char, aes(x = play, y = line_count, fill = character)) +geom_bar(stat ="identity", position ="stack") +geom_text(aes(label =str_wrap(character, width =10)), position =position_stack(vjust =0.5), size =4) +scale_fill_manual(values = random_colors) +labs(title ="Number of Lines per Character by Play",x ="Play",y ="Number of Lines",fill ="Character") +theme_minimal() +theme(axis.title.x =element_blank(),plot.title =element_text(size =14, face ="bold"),panel.grid.major =element_blank(),panel.grid.minor =element_blank(),legend.position ="none")+coord_flip()
In Hamlet, the character Hamlet dominates the play with a total of 1,495 lines. Following him is King Claudius with 546 lines, and Lord Polonius with 355 lines. Notably, Ophelia has 173 lines. In Macbeth, Macbeth himself has 717 lines, establishing him as a central figure. Lady Macbeth follows closely with 265 lines, showcasing her significant presence in the dialogue. Other key characters include Malcolm with 212 lines and Macduff with 180 lines. In Romeo and Juliet, Romeo leads with 612 lines, while Juliet is also prominent with 544 lines. Additional important characters include Friar Laurence with 351 lines, Nurse with 281 lines, and Mercutio with 261 lines.
Plot 2 Average Sentiment by Character Type
############# Sentiment Analysis ############# # Convert text to lowercasecombined_plays$dialogue <-tolower(combined_plays$dialogue)suppressMessages(library(sentimentr))# https://github.com/trinker/sentimentr# Filter out stage direction lines from the combined playsfiltered_plays <- combined_plays %>%filter(!str_detect(character, "\\[stage direction\\]"))# Use the sentimentr package to compute sentiment scoressentiment_results <-sentiment(filtered_plays$dialogue)# Add `element_id` from `sentiment_results` to `filtered_plays` as a unique identifierfiltered_plays <- filtered_plays %>%mutate(element_id =row_number())# Merge on `element_id` to combine the sentiment results with the filtered plays datafiltered_plays <-left_join(filtered_plays, sentiment_results, by ="element_id")# Rename the merged sentiment column to a meaningful name and drop unnecessary columnsfiltered_plays <- filtered_plays %>%select(-element_id, -sentence_id, -word_count) %>%rename(sentiment = sentiment)# View the final structure#str(filtered_plays)# Preview the first few rows of the cleaned data with sentiment scores#head(filtered_plays)# Calculate average sentiment by characteravg_sentiment_by_character <- filtered_plays %>%group_by(character, play, act) %>%summarise(avg_sentiment =mean(sentiment, na.rm =TRUE)) %>%arrange(desc(avg_sentiment))# View top characters by average sentiment#head(avg_sentiment_by_character)############# Character Categorization ############# # Define character categories for each play using these characteristics# Protagonists: Main characters driving the plot.# Antagonists: Characters opposing the protagonists.# Supporting Characters: Key secondary characters that assist the protagonists.# Minor Characters: Less significant characters that contribute to the story.# Hamletprotagonists_hamlet <-c("Hamlet")antagonists_hamlet <-c("King Claudius", "Lord Polonius")supporting_characters_hamlet <-c("Ophelia", "Horatio", "Laertes", "Queen Gertrude")#filtered_plays %>%# filter(character %in% c("Ophelia"))# Macbethprotagonists_macbeth <-c("Macbeth")antagonists_macbeth <-c("Lady Macbeth")supporting_characters_macbeth <-c("Banquo", "Duncan", "Macduff", "Malcolm", "Ross")# Romeo and Julietprotagonists_romeo_juliet <-c("Romeo", "Juliet")antagonists_romeo_juliet <-c("Tybalt", "Paris")supporting_characters_romeo_juliet <-c("Benvolio", "Mercutio", "Nurse", "Friar Laurence", "Capulet", "Lady Capulet", "Montague", "Lady Montague", "Prince")# Combine all character categories into a list for comparisonprotagonists <-c(protagonists_hamlet, protagonists_macbeth, protagonists_romeo_juliet)antagonists <-c(antagonists_hamlet, antagonists_macbeth, antagonists_romeo_juliet)supporting_characters <-c(supporting_characters_hamlet, supporting_characters_macbeth, supporting_characters_romeo_juliet)# Categorize characters in filtered_playsfiltered_plays <- filtered_plays %>%mutate(character_type =case_when( character %in% protagonists ~"Protagonist", character %in% antagonists ~"Antagonist", character %in% supporting_characters ~"Supporting Character",TRUE~"Minor Character"# All others are minor characters ))# Calculate average sentiment by character type and play# avg_sentiment_by_character_type <- filtered_plays %>%# group_by(character_type, play) %>%# summarize(avg_sentiment = mean(sentiment, na.rm = TRUE), .groups = "drop")# # # Create a bar plot # ggplot(avg_sentiment_by_character_type, aes(x = character_type, y = avg_sentiment, fill = play)) +# geom_bar(stat = "identity", position = position_dodge()) +# theme_minimal() +# labs(title = "Average Sentiment by Character Type",# x = "Character Type",# y = "Average Sentiment") +# scale_fill_brewer(palette = "Set3") +# coord_cartesian(ylim = c(-0.15, 0.1)) +# theme(axis.title.x = element_blank(), # Remove x-axis text# #axis.text.y = element_blank(), # Remove y-axis text# plot.title = element_text(size = 14, face = "bold"),# panel.grid.major = element_blank(),# panel.grid.minor = element_blank(),# legend.position = "bottom")# Box plot with facets for a different view for all characters# ggplot(filtered_plays, aes(x = character_type, y = sentiment, fill = character_type)) +# geom_boxplot(outlier.shape = NA, position = position_dodge(width = 0.8)) +# geom_jitter(color = "black", alpha = 0.5, size = 0.5, position = position_jitter(width = 0.2)) + # theme_minimal() +# labs(title = "Distribution of Sentiment by Character Type",# subtitle = "with distribution of all characters within each play",# x = "Character Type",# y = "Sentiment Score") +# scale_fill_brewer(palette = "Set3") +# coord_cartesian(ylim = c(-1.5, 1.5)) +# theme(axis.title.x = element_blank(), # plot.title = element_text(size = 14, face = "bold"),# panel.grid.major = element_blank(),# panel.grid.minor = element_blank(),# legend.position = "bottom") +# facet_wrap(~ play)filtered_plays %>%arrange(desc(sentiment))
filtered_plays %>%arrange(sentiment)
############# Calculate for median lower_threshold <--0.00upper_threshold <-0.00# Calculate average sentiment by character type and playmedian_sentiment_filtered <- filtered_plays %>%filter(sentiment < lower_threshold | sentiment > upper_threshold) %>%# Filter out sentiment close to zerogroup_by(character_type, play) %>%summarize(median_sentiment =median(sentiment, na.rm =TRUE))ggplot(median_sentiment_filtered, aes(x = character_type, y = median_sentiment, fill = play)) +geom_bar(stat ="identity", position =position_dodge()) +theme_minimal() +labs(title ="Median Sentiment by Character Type",x ="Character Type",y ="Median Sentiment") +scale_fill_brewer(palette ="Set3") +coord_cartesian(ylim =c(-0.15, 0.1)) +theme(axis.title.x =element_blank(), # Remove x-axis text#axis.text.y = element_blank(), # Remove y-axis textplot.title =element_text(size =14, face ="bold"),panel.grid.major =element_blank(),panel.grid.minor =element_blank(),legend.position ="bottom")
# Box plot with facets for a different view for all characters# ggplot(filtered_plays, aes(x = character_type, y = sentiment, fill = play)) +# geom_boxplot(outlier.shape = NA, position = position_dodge(width = 0.8)) +# geom_jitter(aes(color = character_type), alpha = 0.5, size = 0.5, position = position_jitter(width = 0.2)) + # Color by character_type# theme_minimal() +# labs(title = "Distribution of Sentiment by Play",# subtitle = "With Distribution of character type",# x = "Character Type",# y = "Sentiment Score") +# scale_fill_brewer(palette = "Set3") +# coord_cartesian(ylim = c(-1.5, 1.5)) +# theme(axis.title.x = element_blank(), # plot.title = element_text(size = 14, face = "bold"),# panel.grid.major = element_blank(),# panel.grid.minor = element_blank(),# legend.position = "bottom") +# facet_wrap(~ play)############# alternate view# Define a threshold for outlierslower_threshold <--1.25# Lower bound for outliersupper_threshold <-1.25# Upper bound for outliers# Identify outliers in your datafiltered_plays <- filtered_plays %>%mutate(is_outlier = sentiment <= lower_threshold | sentiment >= upper_threshold)# Create the plotggplot(filtered_plays, aes(x = character_type, y = sentiment, fill = play)) +geom_boxplot(outlier.shape =NA, position =position_dodge(width =0.8)) +geom_jitter(aes(color = character_type), alpha =0.5, size =0.5, position =position_jitter(width =0.2)) +# Color by character_typetheme_minimal() +labs(title ="Distribution of Sentiment by Play",subtitle ="With Distribution by character type and Outliers Identified",x ="Character Type",y ="Sentiment Score") +scale_fill_brewer(palette ="Set3") +coord_cartesian(ylim =c(-1.5, 1.5)) +theme(axis.title.x =element_blank(), plot.title =element_text(size =14, face ="bold"),panel.grid.major =element_blank(),panel.grid.minor =element_blank(),legend.position ="bottom") +facet_wrap(~ play) +geom_text(data = filtered_plays %>%filter(is_outlier), aes(label = character), # Use character names as labelsvjust =-0.5, # Adjust vertical position of textcolor ="red", # Color for the outlier textsize =3, ) # Adjust text size as needed
The sentiment scores show that the antagonists in Romeo and Juliet display the most negative sentiments, while protagonists across Hamlet and Macbeth are slightly positivity or neutral. The Supporting characters generally show more positive sentiment in Hamlet and Macbeth than the minor characters and antagonists.
In Hamlet, the protagonist shows slight positivity, indicating moments of introspection and depth amid his struggles with existential questions and moral dilemmas reflecting a complex psychological landscape. The antagonists, King Claudius and Lord Polonius display a negative sentiment, implying a morally ambiguous portrayal where their manipulative and deceitful actions create tension against Hamlet. The supporting characters, such as Ophelia, Horatio, Laertes, and Queen Gertrude, generally have favorable sentiments that contribute positively to the narrative, showcasing their roles as emotional anchors in Hamlet’s turbulent journey.
Macbeth has a slightly negative average sentiment score, illustrating his tragic descent from a noble warrior to a tyrannical ruler consumed by ambition and guilt. This character arc highlights the play’s central themes of ambition and moral decay. The primary antagonist, Lady Macbeth, reflects a similarly dark sentiment, indicating her crucial role in driving Macbeth’s ambition and the ensuing chaos. The supporting characters, such as Banquo, Duncan, Macduff, Malcolm, and Ross, are portrayed with average sentiment scores that contribute positively to the narrative. Notably, Banquo’s loyalty and moral integrity contrast sharply with Macbeth’s deteriorating character, creating a compelling dynamic that underscores the tragedy of ambition.
In Romeo and Juliet, we see a slightly positive average sentiment score that reflects their passionate love story. Their relationship, however, is set against a backdrop of familial conflict and societal expectations, contributing to the overall tragic tone of the play. The antagonists, notably Tybalt and Paris, display a more negative sentiment. Tybalt’s aggressive nature and Paris’s socially enforced pursuit of Juliet create obstacles for the young lovers. Their average sentiment score underscores the intense familial and societal conflicts that frame the narrative. Supporting characters, including Benvolio, Mercutio, Nurse, and the Capulet and Montague families, have average sentiment scores that fluctuate, with some providing comic relief while others deepen the tragedy through their actions and responses to the central conflict.
Plot 3 Language Use by Gender
############ Gendered Language Analysis ############ # Create a list of known female charactersfemale_characters <-c("Queen Gertrude", "Ophelia", "Player Queen", "Lady Macbeth", "Lady Macduff", "Gentlewoman", "Lady Capulet", "Lady Montague", "Juliet", "Nurse", "First Witch", "Second Witch", "Third Witch")# Assign gendercombined_plays <- filtered_plays %>%mutate(gender =ifelse(character %in% female_characters, "Female", "Male"))gender_conts <- combined_plays %>%group_by(gender) %>%# Group by the existing gender columnsummarise(count =n()) %>%# Count occurrencesmutate(proportion = count /sum(count))# Unnest the dialogue into tokenstidy_text <- combined_plays %>%unnest_tokens(word, dialogue)# Join with Bingsentiment_data <- tidy_text %>%inner_join(get_sentiments("bing"), by ="word")# Group sentiments by gendergender_sentiment <- sentiment_data %>%group_by(gender, sentiment.y) %>%summarise(word_count =n()) %>%ungroup()# Create a bar plotgender1 <-ggplot(gender_sentiment, aes(x = gender, y = word_count, fill = sentiment.y)) +geom_bar(stat ="identity", position ="dodge") +theme_minimal() +scale_fill_brewer(palette ="Set2") +labs(title ="Sentiment by Gender",x =NULL,y ="Word Count")+theme(laxis.title.x =element_blank(), plot.title =element_text(size =14, face ="bold"),panel.grid.major =element_blank(),panel.grid.minor =element_blank(),legend.position ="None",axis.text.x =element_blank(), # Remove x-axis textaxis.ticks.x =element_blank()) +# Remove x-axis ticks) + coord_flip() +geom_text(aes(label = word_count), position =position_dodge(width =0.9),hjust =1.3,color ="black", size =4) # Male characters express a higher count of both positive and negative words compared to female characters which is expected with 19% of the characters being female # Type-Token Ratio (TTR) evaluates the richness or diversity of vocabulary in a text. TTR is the total number of unique words (types) divided by the total number of words (tokens) in a segment of language.# Calculate lexical richness by grouping by genderlexical_richness <- tidy_text %>%group_by(gender) %>%summarise(unique_words =n_distinct(word), total_words =n()) %>%mutate(ttr = unique_words / total_words) # Type-Token Ratio (TTR)# Create a bar plot for Type-Token Ratio (TTR) by gendergender2 <-ggplot(lexical_richness, aes(x = gender, y = ttr, fill = gender)) +geom_bar(stat ="identity") +scale_fill_brewer(palette ="Set2") +theme_minimal() +labs(title ="Language Use by Gender",x =NULL,y ="Lexical Richness")+theme(#axis.title.x = element_blank(), plot.title =element_text(size =14, face ="bold"),panel.grid.major =element_blank(),panel.grid.minor =element_blank(),legend.position ="None",axis.text.x =element_blank(), # Remove x-axis textaxis.ticks.x =element_blank()) +# Remove x-axis ticks coord_flip() +geom_text(aes(label =round(ttr, 2)), position =position_dodge(width =0.9),hjust =1.3,color ="black", size =5) # We see that female characters use a relatively diverse range of vocabulary in their dialogues, suggesting that female characters often use a much wider variety of different words, conveying nuanced emotions or ideas through a more varied choice of language rather than repeating the same words. # With male characters, we see that their vocabulary is less diverse than that of female characters. Although they engage in longer dialogues and may express complex thoughts, they repeat certain words and phrases more frequently.# Even though women may be portrayed as more emotionally expressive or complex, male characters appear more focused on action or less varied in their emotional expression, leading to repetitive phrasing.